-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore freestanding ellipses #235
Conversation
What about things like this:
I'm probably doing it wrong, but what is the defined behavior if the ellipses is between other USFM markers? |
Should this happen before tokenization? I'm not quite understanding. |
This seems like the wrong place to change it - this was added to solve a different issue. Does this change cause the tests to fail? It likely should... |
Previously, johnml1135 (John Lambert) wrote…
This may work - but I don't understand this part of the parse to know where all the places that it will work. I also don't know how paratext populates \q1, and other poetry markers - nor to I know how this may handle populating section headers etc. with ... if that is an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 3 files at r1, all commit messages.
Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit and @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To respond to all your comments in one place, @johnml1135, my goal was to treat verse's whose only text is '...' the same as white-space-only verses. This might be a hacky solution, but I'm not sure what the less hacky solution would be. I could certainly define a regex or something if we're interested in potentially treating other text as white space (e.g., only punctuation or something). Or maybe there's a way to think of this as a special kind of usfm token, but I also don't want to go overboard just for this small issue.
Judging from the project I was working with in which I originally saw this issue, it appears that '...' only appears as filler for an entire verse text - i.e., I didn't see examples of multiple poetic lines (or something similar) each with '...', but I'm not sure if those might exist as well since, like I said, I'm just basing this off one project.
Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are trying to handle the specific case where Paratext inserts ellipses to fill an empty verse. I think you should narrow the check even further and only apply it when you are creating text rows for a verse, i.e. UsfmTextBase.EndVerseText
. I want to avoid incorrectly interpreting the intentional use of ellipses as empty text.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @Enkidu93)
I agree that changing the code there would prevent partial matches. @ddaspit - do you know all the instances in which the ellipsis might be added in Paratext? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like this, @ddaspit ?
Yeah, it would be nice to know what Paratext's behavior is such that we can make appropriately narrow and also so that we can have reasonable tests.
Reviewable status: 1 of 3 files reviewed, 3 unresolved discussions (waiting on @ddaspit)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #235 +/- ##
=======================================
Coverage 69.80% 69.81%
=======================================
Files 379 379
Lines 31478 31482 +4
Branches 4399 4400 +1
=======================================
+ Hits 21974 21978 +4
Misses 8483 8483
Partials 1021 1021 ☔ View full report in Codecov by Sentry. |
I think a simple comment would be appropriate here - what are we doing and why? "Paratext sometimes auto-adds an ellipses for newly instantiated verses. This should be ignored for NLP tasks." or similar. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me check on the Paratext behavior and get back to you.
Reviewed 1 of 2 files at r2.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't figure out where the ellipses are coming from. I thought they were something that Paratext would insert in some cases, but I can't find any evidence of that. They might be coming from Adapt It, but I'm not sure how the ellipses would get back into the source USFM from Adapt It. If this isn't standard Paratext behavior, we shouldn't handle this in Machine. Instead, we should handle this during preprocessing in Serval.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can ask Michael about it tomorrow (?), @ddaspit, before committing to that switch. He seemed fairly sure that Paratext created the ellipses when the team made an empty book and my impression was that this was typical.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke to Michael about it. He doesn't know where they come from. I actually looked at the Paratext repo and found no trace of any code that adds ellipses. I looked at a couple of projects that contain them and can't find a consistent pattern.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so what's the plan here. Do you want to relocate this logic? What do you mean by preprocessing in Serval? There isn't really any file preprocessing is there, right?
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to filter out any segments that are ...
when generating the training corpus and pretranslation JSON file in PreprocessBuildJob.WriteDataFilesAsync
.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just making sure I wasn't missing something 😆
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @Enkidu93)
Moving filtration to serval: sillsdev/serval#469 |
Fixes sillsdev/serval#458.
I think this is the simplest solution. Let me know what you think.
This change is